Clustering is one of the hardest problems in ML, because:
Automatically created module for IPython interactive environment
https://www.scribd.com/document/380399478/Teuvo-Kohonen-Self-Organizing-Maps
Also mention: Vector Quantization
Teuvo Kohonen, 1995: Self-Organizing Maps
Image('../images/Kohonen.png', width=320, height=420)
([], [])
([], [])
([], [])
([], [])
([], [])
([], [])
In clustering we try to find observations that are near to each other and further away from dissimilar observations. The Notion of 'similar' only makes sense, when we pick out some variables that seem to be salient for the comparison of observations.
Most often we use similarity measures as for example Euclidean Distance
Here, $m$ and $n$ are two observations with measures on $n$ different variables. The Euclidean Distance is well known from the Pythagorean theorem: $a^2 + b^2 = c^2$; it is just its extensions to more than two dimensions.
What happens when the Variables are scaled differently? Consider for example:
absolute spending in different categories vs. relative (percentage of) spending in different categories
absolute spendings:
<AxesSubplot:>
| a | b | c | |
|---|---|---|---|
| a | 0.000000 | 22219.023381 | 21100.947846 |
| b | 22219.023381 | 0.000000 | 28928.100525 |
| c | 21100.947846 | 28928.100525 | 0.000000 |
relative spendings:
<AxesSubplot:>
| a | b | c | |
|---|---|---|---|
| a | 0.000000 | 0.227983 | 0.252631 |
| b | 0.227983 | 0.000000 | 0.410489 |
| c | 0.252631 | 0.410489 | 0.000000 |
As long as all variables have the same meaning and the same scale, clustering is still straight forward. But what are we going to do, if we have different variables with different scales? Consider adding the Age of the customers. A Difference in Age of 40 years adds not much to the distance of absolute spendings (in thousands of CHF), But will dominate the distance measure for relative spendings (percentages between 0 and 1).
Spectral clustering originates from Graph Theory (Spectral Analysis) The algorithms works as follows:
the laplacian matrix $L$ is formed by $L = D -W$. This matrix has several interesting properties:
are there $r$ connected components (cluster) in the data and if observations are ordered accordingly, the laplacian has block diagonal form: \begin{equation*} \begin{bmatrix} L_{1} & & \\ & \ddots & \\ & & L_{r} \end{bmatrix} \end{equation*}
the blocks $L_i$ are proper laplacian matrices on their own. Since each component (cluster) is connected within, $L_i$ has exactly one eigenvalue that is 0. The corresponding eigenvector is constant and is zero for all other components. These eigenvectors hence are indicator vectors for their component.
# this simple example is taken from https://towardsdatascience.com/unsupervised-machine-learning-spectral-clustering-algorithm-implemented-from-scratch-in-python-205c87271045
import numpy as np
import matplotlib.pyplot as plt
X = np.array([
[1, 3], [2, 1], [1, 1], [3, 2], [7, 8], [9, 8], [9, 9], [8, 7], [13, 14], [14, 14], [15, 16], [14, 15]
])
plt.scatter(X[:,0], X[:,1], alpha=0.7, edgecolors='b')
plt.xlabel('Weight')
plt.ylabel('Height')
Text(0, 0.5, 'Height')
Next we buid the connectivity matrix $W$ (also called adjacency matrix). We see, it has three connected components (clusters):
from sklearn.metrics import pairwise_distances
W = pairwise_distances(X, metric="euclidean")
vectorizer = np.vectorize(lambda x: 1 if x < 5 else 0)
W = np.vectorize(vectorizer)(W)
print(W)
[[1 1 1 1 0 0 0 0 0 0 0 0] [1 1 1 1 0 0 0 0 0 0 0 0] [1 1 1 1 0 0 0 0 0 0 0 0] [1 1 1 1 0 0 0 0 0 0 0 0] [0 0 0 0 1 1 1 1 0 0 0 0] [0 0 0 0 1 1 1 1 0 0 0 0] [0 0 0 0 1 1 1 1 0 0 0 0] [0 0 0 0 1 1 1 1 0 0 0 0] [0 0 0 0 0 0 0 0 1 1 1 1] [0 0 0 0 0 0 0 0 1 1 1 1] [0 0 0 0 0 0 0 0 1 1 1 1] [0 0 0 0 0 0 0 0 1 1 1 1]]
Next we buid the degree matrix $D$ and finally the laplacian as $L = D - W$
#D = np.diag(np.reshape(W.dot(np.ones((W.shape[1], 1))), (W.shape[1],)))
D = np.diag(W.sum(axis=1))
print(D)
[[4 0 0 0 0 0 0 0 0 0 0 0] [0 4 0 0 0 0 0 0 0 0 0 0] [0 0 4 0 0 0 0 0 0 0 0 0] [0 0 0 4 0 0 0 0 0 0 0 0] [0 0 0 0 4 0 0 0 0 0 0 0] [0 0 0 0 0 4 0 0 0 0 0 0] [0 0 0 0 0 0 4 0 0 0 0 0] [0 0 0 0 0 0 0 4 0 0 0 0] [0 0 0 0 0 0 0 0 4 0 0 0] [0 0 0 0 0 0 0 0 0 4 0 0] [0 0 0 0 0 0 0 0 0 0 4 0] [0 0 0 0 0 0 0 0 0 0 0 4]]
L = D - W
print(L)
[[ 3 -1 -1 -1 0 0 0 0 0 0 0 0] [-1 3 -1 -1 0 0 0 0 0 0 0 0] [-1 -1 3 -1 0 0 0 0 0 0 0 0] [-1 -1 -1 3 0 0 0 0 0 0 0 0] [ 0 0 0 0 3 -1 -1 -1 0 0 0 0] [ 0 0 0 0 -1 3 -1 -1 0 0 0 0] [ 0 0 0 0 -1 -1 3 -1 0 0 0 0] [ 0 0 0 0 -1 -1 -1 3 0 0 0 0] [ 0 0 0 0 0 0 0 0 3 -1 -1 -1] [ 0 0 0 0 0 0 0 0 -1 3 -1 -1] [ 0 0 0 0 0 0 0 0 -1 -1 3 -1] [ 0 0 0 0 0 0 0 0 -1 -1 -1 3]]
# eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(L)
# sort these based on the eigenvalues
eigenvectors = eigenvectors[:,np.argsort(eigenvalues)]
eigenvalues = eigenvalues[np.argsort(eigenvalues)]
plt.scatter(np.arange(1, W.shape[1]+1), eigenvalues, alpha=0.7, edgecolors='b')
plt.xlabel('sorted eigenvalues')
plt.ylabel('eigenvalues')
Text(0, 0.5, 'eigenvalues')
print('corresponding eigenvectors')
print(eigenvectors[:, 0:3])
corresponding eigenvectors [[-0.5 0. 0. ] [-0.5 0. 0. ] [-0.5 0. 0. ] [-0.5 0. 0. ] [ 0. -0.5 0. ] [ 0. -0.5 0. ] [ 0. -0.5 0. ] [ 0. -0.5 0. ] [ 0. 0. -0.5] [ 0. 0. -0.5] [ 0. 0. -0.5] [ 0. 0. -0.5]]
This is a somewhat contrieved example but it illustrates the principal idea. For normal data that is not as clear as in our example, the spectral embedding matrix (eigenvectors belonging to the $s$ smallest eigenvalues) can be clustered by k-means. Another possibility is to start with the embedding matrix as a indicator matrix and optimize it as to maximize the in-component connectivity (association) and minimize the between component connectivity (cuts)(see https://www1.icsi.berkeley.edu/~stellayu/publication/doc/2003kwayICCV.pdf). In the python sklearn library the last method is called 'discretize'.
We are going to demonstrate the virtues of spectral clustering on a text-clustering example. The data is a common dataset with nearly balanced classes. Hence, k-means should also be appropriate.
Here is one example:
From: sigma@rahul.net (Kevin Martin)
Subject: Re: Stay Away from MAG Innovision!!!
Nntp-Posting-Host: bolero
Organization: a2i network
Lines: 10
In <16BB58B33.D1SAR@VM1.CC.UAKRON.EDU> D1SAR@VM1.CC.UAKRON.EDU (Steve Rimar) writes:
>My Mag MX15F works fine....................
Mine was beautiful for a year and a half. Then it went <foomp>. I bought
a ViewSonic 6FS instead. Another great monitor, IMHO.
--
Kevin Martin
sigma@rahul.net
"I gotta get me another hat."
Image("../images/20_newsgroups.png")
#<img alt="taken from homepage of 20 newsgroups" caption="The different categories of the newsgroup posts" id="20_newsgroups" src="../images/20_newsgroups.png">
#%reload_ext autoreload
#%autoreload 2
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.cluster import SpectralClustering
from sklearn.cluster import KMeans
from sklearn import metrics
import nmslib
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
dataset = fetch_20newsgroups(subset='all', shuffle=True, download_if_missing=True)
# http://qwone.com/~jason/20Newsgroups/
np.random.seed(123)
texts = dataset.data # Extract text
target = dataset.target # Extract target
display(len(texts))
18846
refresher:
$w_{i,j} = \text{tf}_{i,j} \cdot \log\frac{N}{\text{df}_i}$
with
$\text{tf}_{i,j}$ is the frequency of term i in document j
$\text{df}_i$ is the number of documents containing term i
Since we want to cluster newsgroup posts we are more interested in words that are nearly unique to certain groups. By setting max_df = 0.3 we ensure that only words are considered that are not too common, i.e. only in 30% of all posts. By contrast, words that are very seldom but are concerned with special topics discussed in the groups are most important for our endeavor. Hence, there is no limit for min_df.
We tell the vectorizer to remove common english stop words.
By default the vectors returned are normed, i.e. they all have length 1. This is important when we compute the cosine similarity between the vectors.
The rand index is a measure of how well two partitions of a set of objects coincide:
\begin{equation*} R=\frac{a + b}{a + b + c + d} = \frac{a + b}{{n \choose2}}, \end{equation*}where
$a$ is the number of pairs of elements that are in the same subset for both partitions
$b$ is the number of pairs of elements that are in different subsets for both partitions
$c$ is the number of pairs of elements that are in the same subset for the first partition but not for the second
$d$ is the number of pairs of elements that are in the different subsets for the first partition but in the same subset for the second partition
The Rand index is the percentage of consistent/congruent decisions for the two partitions.
# norm='l2' is default
vectorizer = TfidfVectorizer(stop_words='english', max_df = 0.3)
X = vectorizer.fit_transform(texts)
print(f'{X.shape[0]}, {X.shape[1]}')
18846, 173438
We have several possibilities to express the similarity / connectedness between observations. Some of common ones:
Since we deal with very long and sparse vectors, euclidean distance is not a good choice. This has to do with the curse of dimensionality (search for it). Typically, for these long tf-idf vectors cosine similarity is used. For bigger data sets, I can not see any reason why we should compute the similarities for all points in the set. Probably, it is sufficient to consider only th $k$ nearest neighbors of each point. This allows us to use sparse matrices, that are far more memory efficient. The sklearn spectral clustering implementation can deal with those matrices. As we will see, this simplification further reduces noise in the data and encourages better overall solutions.
Computing all mutual distances between the 18846 posts can be very time and memory consuming. Theoretically, we only have to compute the triangular matrix minus the diagonal (distance matrices are symmetrical):
$0.5 \cdot 18846^2 - 18846/2 = 177576435$ mutual distances. A clever idea here, is to compute only the k nearest neighbors of each post and treat all other posts as maximal distant. There are a lot of algorithms around for computing nearest neighbors, the most efficient relying on kd-trees.
However, there are also approximate nearest neighbor algorithms that are nearly 100% accurate and are even faster than kd-trees. One of those is the hnsw-algorithm as implemented in the python nmslib (an api for the underlying c++ code).
https://github.com/erikbern/ann-benchmarks
# see discussions about 'curse of dimensionality'; to be safe, we opt for cosine-similarity
# since we want to be most efficient in everything we do, we use sparse matrices and vectors
index = nmslib.init(method='hnsw', space='cosinesimil_sparse', data_type=nmslib.DataType.SPARSE_VECTOR)
index.addDataPointBatch(X)
index_time_params = {'post':2}
index.createIndex(index_time_params, print_progress=True)
# now we can query the index
# By computing distances for the k=1000 nearest neighbors, we have to store 1000 * 18846 = 18846000
# distances; but compared to the triangular matrix approach discussed above, we still save 158730435
# entries in our matrix.
nn = 1000
neighbors = index.knnQueryBatch(X, k=nn, num_threads=4)
0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| *************************************************** 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| *****************************************************
Next, we build our affinity matrix. Sparse matrices only store the indices and the data for entries in the matrix that acutally contain data.
from scipy.sparse import csc_matrix
# next we construct a sparse matrix with the distances as measured by cosine similarity
col = np.array([i for n in neighbors for i in n[0].tolist()]) # column indices of nearest neighbors
row = np.repeat(np.arange(0, len(neighbors)), np.array([len(n[0]) for n in neighbors])) # the row index for each nearest neighbor
# data = np.array([i for n in neighbors for i in n[1].tolist()]) # the similarities
data = np.array([1 for n in neighbors for i in n[1].tolist()]) # the similarities
# btw, if you do not understand what's going on in the list part of the construction of the col and
# the data variable above, I stronly recommend to have a look at *list comprehension*;
# list comprehension is a super fast way to get things done in python
affinity = csc_matrix((data, (row, col)), shape = (X.shape[0], X.shape[0]))
# n_clusters = 20 because we know, that there are 20 different newsgroups
# n_components = 21; this is cheated because it should normally be set to the same number as
# the n_clusters, but 21 yielded a better solution
# assign_labels = 'discretize' defines how the clusters are found within the
# embedding matrix (matrix composed of eigenvectors belonging to the n_components
# smallest eigenvalues). As discussed above, 'discretize' maximizes the
# in-component connectivity and minimizes the between-component connectivity
# affinity = 'precomputed' indicates that the affinity matrix is provided by the user
# eigen_solver = 'amg' assumes the python-package pyamg is installed. It is by far the fastest
# way the eigenvalue decomposition
#
solution = SpectralClustering(n_clusters = 20, n_components = 21, \
assign_labels='discretize',\
affinity = 'precomputed',\
eigen_solver='amg'\
).fit(0.5 * (affinity + affinity.T))
metrics.adjusted_rand_score(solution.labels_, target)
0.46542605897043254
So, let's see what our nearest-neighbor trick was good for. In the next cell we compute the spectral clustering for the whole tf-idf matrix and leave it to sklearn to compute all cosine similarities. As discussed above, since the tf-idf vectors are by default normalized, we can use the 'linear' kernel instead of the 'cosine' kernel. This is faster because the additional steps of getting the norms and dividing is not needed.
# linear kernel instead of cosine_similarity if tf-idf vectors are already normalized
solution = SpectralClustering(n_clusters=20, n_components = 21,
assign_labels='discretize',\
affinity='linear',\
eigen_solver='amg').fit(X)
metrics.adjusted_rand_score(solution.labels_, target)
0.2755746277110436
What would the solution be like if we would have used ordinary euclidean distance (=rbf kernel=gaussian kernel)? gamma ($\gamma = \frac{1}{2\sigma^2}$) defines the size of the neighborhood and is similar to the choice of the parameter $k$ in the approx. k-nearest neighbor algorithm we used in the construction of our affinity matrix.
solution = SpectralClustering(n_clusters=20, n_components = 21, assign_labels='discretize',\
affinity='rbf',eigen_solver='amg',\
gamma = 0.7).fit(X)
metrics.adjusted_rand_score(solution.labels_, target)
0.17360992022782784
How is ordinary k-means performing? Clustering is done on the tf-idf vectors.
solutionKMeans = KMeans(n_clusters=20, init='k-means++', max_iter=100, n_init=1).fit(X)
metrics.adjusted_rand_score(solutionKMeans.labels_, target)
0.09241246682493831
Could we have guessed the right number of clusters from the eigenvalues of the laplacian as suggested in all tutorials about spectral clustering?
plt.scatter(np.arange(len(eigs[0:255])), eigs[0:255])
plt.plot([19]* 2, [eigs[19]] * 2, 'ro')
plt.grid()
Density-based clustering of applications with noise.
DBSCAN has the concept of core point, border point and outlier. A data point is considered a core point if there are sufficient (minPts) data points in its vicinity of radius eps. Border points are those points that are connected to core points but themselves have not enough data point in their neighborhood. Outliers are the points that are not connected to a core points.
The image is taken from wikipedia
A starting point is selected by random and its neighborhood is scanned. Then this is done for all points in the neighborhood and so on. In the next step the starting point is choosen among the points not visited previously until all points were visited.
Core points are connected to other core-points and form clusters.
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.5, min_samples=14, metric='cosine', n_jobs=-1)
clusters = db.fit(X)
np.unique(clusters.labels_)
array([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20])
metrics.adjusted_rand_score(clusters.labels_, target)
8.75031255964891e-05